Segmenting DNA sequence into `words'

نویسنده

  • Wang Liang
چکیده

[Abstract] This paper presents a novel method to segment/decode DNA sequences based on statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. Then we apply the unsupervised approach to build the DNA vocabulary and design DNA sequence segmentation method. We also find different genomes is likely to use the similar ‘languages’.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting DNA sequence into 'words' based on statistical language model

[Abstract] This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach...

متن کامل

Phoneme Segmenting Alignment with the Common Core Foundational Skills

In 2006, the easyCBM reading assessment system was developed to support the progress monitoring of phoneme segmenting, letter names and sounds recognition, word reading, passage reading fluency, and comprehension skill development in elementary schools. More recently, the Common Core Standards in English Language Arts have been introduced as a framework for outlining grade-level achievement exp...

متن کامل

Segmenting Narrative Text into Coherent Scenes

This paper describes a quantitative indicator for segmenting narrative text into coherent scenes. The indicator, called the lexical cohesion pro le (LCP), records lexical cohesiveness of words in a xed-length window moving word by word on the text. The cohesiveness of words, which represents their coherence, is computed by spreading activation on a semantic network. The basic idea of LCP is: (1...

متن کامل

The Comparison of different Procedures for DNA extraction from paraffin-embedded Tissues: A commercial kit and a traditional method based on heating

Abstract Background and objectives: Paraffin-embedded tissues and clinical samples are a valuable resource for molecular genetic studies, but the extraction of high-quality genomic DNA from this tissues is still a problematic issue. In the Present study, the efficiency of two DNA extraction protocols, a commercial kit and a traditional method based on heating and K Proteinase was compared. Mate...

متن کامل

MOSAIC: segmenting multiple aligned DNA sequences

UNLABELLED MOSAIC is a set of tools for the segmentation of multiple aligned DNA sequences into homogeneous zones. The segmentation is based on the distribution of mutational events along the alignment. As an example, the analysis of one repeated sequence belonging to the subtelomeric regions of the yeast genome is presented. AVAILABILITY Free access from ftp://ftp.biomath.jussieu.fr/pub/pape...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012